Large Vocabulary Audio-Visual Speech Recognition Using Active Shape Models
نویسندگان
چکیده
Orthogonal information present in the video signal associated with the audio helps in improving the accuracy of a speech recognition system. Audio-visual speech recognition involves extraction of both the audio as well as visual features from the input signal. Extraction of visual parameters is done by the recognition of speech dependent features from the video sequence. This paper uses geometrical features to describe the lip shapes. Curve-based Active Shape Models are used to extract the geometry. These geometrically represented visual parameters are used along with the audio cepstral features to perform an audio-visual classification. It is shown that the bimodal system presented here gives an improvement in the classification results over classification using only the audio features.
منابع مشابه
Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop
We report a summary of the Johns Hopkins Summer 2000 Workshop on audio-visual automatic speech recognition (ASR) in the large-vocabulary, continuous speech domain. Two problems of audio-visual ASR were mainly addressed: Visual feature extraction and audio-visual information fusion. First, image transform and model-based visual features were considered, obtained by means of the discrete cosine t...
متن کاملLarge Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit
This paper describes audio-visual speech recognition experiments on a multi-speaker, large vocabulary corpus using the Janus speech recognition toolkit. We describe a complete audio-visual speech recognition system and present experiments on this corpus. By using visual cues as additional input to the speech recognizer, we observed good improvements, both on clean and noisy speech in our experi...
متن کاملAsynchronous stream modeling for large vocabulary audio-visual speech recognition
This paper addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be tr...
متن کاملUsing Likelihood L-statistic as Confidence Measure in Audio-visual Speech Recognition
This paper describes recent work on decision fusion in audio-visual speech recognition. In this work, a novel approach is proposed to combine audio and video channels information in audio-visual speech recognition scenario. For simplicity, we have only considered frame-level phonetic classification problem using two singlestream Gaussian Mixture Model (GMM). Audio and video streams are adaptive...
متن کاملImproving lip-reading performance for robust audiovisual speech recognition using DNNs
This paper presents preliminary experiments using the Kaldi toolkit [1] to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kal...
متن کامل